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Abstract 

Compared to Multilayer Neural Networks with real weights. Binary Multilayer 
Neural Networks (BMNNs) can be implemented more efficiently on dedicated 
hardware. BMNNs have been demonstrated to be effective on binary classifi¬ 
cation tasks with Expectation BackPropagation (EBP) algorithm on high dimen¬ 
sional text datasets. In this paper, we investigate the capability of BMNNs using 
the EBP algorithm on multiclass image classification tasks. The performances 
of binary neural networks with multiple hidden layers and different numbers of 
hidden units are examined on MNIST. We also explore the effectiveness of im¬ 
age spatial filters and the dropout technique in BMNNs. Experimental results on 
MNIST dataset show that EBP can obtain 2.12% test error with binary weights and 
1.66% test error with real weights, which is comparable to the results of standard 
BackPropagation algorithm on fully connected MNNs. 


1 Introduction 

In recent years, deep neural networks (DNNs) have attracts tremendous attentions from a wide range 
of research areas related to signal and information processing. State-of-the-art performances have 
been achiev ed with DNN techn iques on various chall enging tasks and applications, such as speec h 
recognition dHinton et alll20l3), objec t recog nition (iKrizhevskv et al 112012tlSzegedv et al,Ll2014l) , 
multimedia event detection dLan et all 1 201 J) . etc. Almost all the current DNNs are real-valued- 
weight Mutlilayer Neural Networks (RMNNs). However, an effective RMNNs are often massive 
and require large computational and energetic resources. For example, GoogLeNet has 22 layers 
with tens of thousands of hidden units (ISzegedv et all 120141) . MNNs with binary weights (BMNNs) 
have the advantage that they can be implemented efficiently on dedicated hardware. For example, 
[Karakiewicz et al] d2012b have presented a chip which enable 10 12 multiply accumulates per second 
per mW power efficiency with binary weights. Thus, it is attractive to develop effective algorithms 
for BMNNs to achieve comparable performances with RMNNs. 

Traditional MNNs are trained with BackPropagation (BP) or similar gradient descent methods. 
However, BP or gradient descent methods cannot be directly used for training binary neural net- 
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works. A straightforward method for this problem is to binarize the real-valued weights, while this 
approach will decrease the performance significantly. Recently, ISoudrv et al.1 (120 14l> presented an 
Expectation BackPropagation (EBP) algorithm, which can support online training of MNNs with ei¬ 
ther continuous or discrete weight values. Experiments on several large text datas ets s how promising 
performances on binary classification tasks with bi nary-w eighted MNNs dSoudrv et all |2014|) . As 
an extension of the previous work by ISoudrv et al.l (120141) . in this work, we study the performance 
of EBP algorithm on image classification tasks with binary and real weights MNNs. Besides, we 
investigate the effects of different factors, such as network depth, layer size and dropout strategies, 
on the performance of EBP algorithm in image classification. This study explores the possibility of 
using BMNNs for the multimedia supervised classification tasks. 

2 Expectation Backpropagation 

In this section, we review the expectation backpropagation (EBP) and introduce how to implement 
the EBP algorithm for binary weights in detail. Before introducing the EBP algorithm, we first 
describe the general notations. 

A blodfaced capital letter X denotes a matrix with components Xij. A blodfaced non-capital letter 
x denotes a column vector with components tc*. Besides, X/ denotes Xij and X; denotes Xj 3 j. 
The indicator function 1(A) denotes that 1(A) = 1 if condition A holds, and 0 otherwise. We 
consider a general feedforward Multilayer Neural Networks (MNN) with connections only between 
adjacent layers. Suppose the MNN has L layers, Vi is the number of hidden units in the /-th layer, 
and W = {W(} ; 7 1 1 is weight matrices V) x V/_ 1 between the (l — l)-th layer and Mh layer. For 
simplicity, the activation function is Vi = sign (Wi vi-i) function in this study. The output of the 
network is therefore 

Vl = g(\ 0 , W) = sign(W L sign(W L -i)sign(...W 1 x 0 )) (1) 


Similar to supervised learning with MNNs, the task is to learn IV for a MNN with known architec¬ 
ture given a set of labeled data pairs D]y = , y ^ }^ =1 (note Dq = 0), where each x ^ £ K y ° 

is a data point, and each y£ {—1, +1} Vl is a label. 

The EBP algorithm is derived within Bayesian framework. Given the labeled dataset, the aim is 
to find the weights IV to maximize the posterior probability / J (IV /Tv). With the posterior, one 
can obtain the most probable weight configuration to minimize the expected zero-one loss over the 
outputs using the Maximum A Posteriori (MAP) estimation. 

y* = argmaxy G y^2l{g(x, IV) = y}P(W\D N ) (2) 

w 

The posterior P(W\Dn) is updated in an online setting, where samples arrive sequentially. Ac¬ 
cording to the Bayes rule, when the n-th sample is arrived, the posterior is updated as follows. For 
n = 1,..., N, 

P(W\D n ) oc P(y ( ' n ' > \x < ' n \\V)P(W\D n _i) (3) 

However, this update is generally intractable for large networks, as there is an exponential number of 
values for P(W\D n ) to be stored and updated. To solve this problem, the mean-field approximation 
is used to approximate P( TV \D n ). Specifically, P(kV|-D„) is approximated by P(W\D n ), for which 

P(W\D n ) = Y[P(W ijtl \D n ) (4) 

i,j,l 

where each factor is normalized. Based on the equation, performing a marginal of the posterior (see 
appendix A in ISoudrv et al.l (120 14l> for details) of the Bayes update and re-arrange terms, we can 
obtain a Bayes-like update to the marginal 

P(W\D n ) ex P(y(")|x<">, W ij ,i,D n _ 1 )P(W ij ,i\D n _ 1 ) (5) 

where 

P(y(")\xW,W ij , l ,D n _ 1 )= J2 P(y (n) |x (n) ,yv) n P(Wi tm \D n _ 1 ) (6) 
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is the marginal likelihood. Accordingly, the P(W\D n ) can be directly updated in a single step. The 
problem is that Eq. [6] contains a generally intractable summation over an exponential number of 
values. 

To simplify the summation, another approximation is performed by assuming that the neuronal fan- 
in is “large”, namely, a large number of units in the previous layer is connected to each unit in 
the next layer. Since all the other weights besides are independent (based on the mean field 
approximation), together with the large fan-in assumption, we can assume that the normalized input 
to each neural layer is a Gaussian distribution based on the Central Limit Theorem (CLT), thus 

Vto I U m = W m V m—l/'\j A m ~ W (firm 5]m) (7) 

This is a quite common and effective one ( IRibeiro & Qppeii 1201 ll) approximation. Using this 
approximation (Eq. [7} and the activation function v m = sign{ u TO ), the distribution of u m and 
v m can be calculated sequentially for all the layers in £ (1,.... L } (“forward pass”), for any 
given value of Vo (i.e., the input) and Wi 3 j,. At the end of the forward pass, we can obtain 
P(y\Wij,i) = P(?L — y| Wijj), With the obtained P(y\Wjjj), we can use Eq. [5] to up¬ 

date 

Because it is very computational to directly calculate P(vl = y|IT'',;,,/) for every i,j,l, Taylor 
expansion of Wijj (around its mean, {Wij.i) to first order) is used to approximate P(\l = y| Wijj). 
The first order terms in this expansion can be calculated using backward propagation of derivative 
terms 

A k ,m = d\nP(\ L = y)/d/j, k ,m (8) 

Thus, after a forward pass for u m and v m , m £ {1,.... L), and a backward pass for P(vl = 
y\Wij,i), l £ {L ,..., 1} for all W{ h i, we can update P{W t jj) in each training epoch. In the next, we 
will summarize the general Expectation BackPropagation algorithm and introduce the implemen¬ 
tation of EBP algorithm using binary weights and real bias. More detailed information about the 
implementation for real weights is described in lSoudrv et al.l (l2014l> . 


2.1 The Expectation Backpropagation Algorithm 


Given input x and desire output y, a forward pass is first performed to calculate the mean output (vi) 
for each layer; then a backward pass is conducted to update P(Wij : i\D n ) for all the weights. 

Forward pass First, we initialize the MNN input (vk,o) = %k for all k, and then calculate recursively 
the following values for m = 1,..., L and all k 


f^k,m 



Vm -1 

'Em r,m ) l) 5 

r =1 




(9) 


K n 


E <^,m>(<W«W,m-l> 2 — 1) + 1) — (W kr ,m) 2 (W.m-t) 5 


( 10 ) 


where ( Wk r ,m) is the mean of the posterior distribution P(Wijj\D n ). and a* n are the mean and 
variance of u rn of the input of layer to, and (v m ) is the resulting mean of the output of layer to 


Backward pass The backward pass performs the Bayes update of the posterior (Eq.[5| using_a Taylor 
expansion. Based on Eq. [8] we first initialize A* l for all i (refer to the Eq. C.9 in (lSoudrv et all 

12014 ) as: 


A i tL = Vi 


Af(0\lM, L ,crl L ) 

<S>(yilli,L/cri,L) 


( 11 ) 


Then, for l = L ,..., 1 and Vi, j, we calculate 


V m 


A^z-i = 


VKi' 




3=1 


In P(W ijtl )\D n = In P(Wij,,!£>„_!) + —=Wi jt i\,i(v j j-\) + C 

v A; 


( 12 ) 

(13) 
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where C is an unimportant constant, which is not dependent on Wjjj. 

Output Based on the learnt weight configuration W*, the output can be obtained by g(x, W) by 
Eq. Q] which is defined as the Deterministic EBP output (EBP-D) (Soudry et al., 12013) . Alterna¬ 
tively, the MAP output (Eq.01 can be calculated directly 

y* = argmax vey In P(y L = y) = argmax yey [y] ln( | + (14) 

V 1 - ' Vk ’ L > 

using from Eq.[9] The output of Eq.[l4]is defined to be the Probabilistic EBP output (EBP-P). 


2.2 Implementation for Binary Weights 


In the imp lementation of binary weights, the weight Wijj can only take value 
ISoudrv et al.l (120141) . the distribution of W Kl j is parameterized in the way so that 


P(Wij,i\D n ) 




h M 

e o.i -p e 


-h 


(n) 

■j,l 


{-!,+!}. In 


05) 


According to the forward process (Eq. [9] and Eg. ITOl). the parametrization can be used to compute 

= tanh(hijj), = 1 and Var(Wij t i) = sech 2 (hij,l). In the backward processing, 

substituting Eq. [l5]into Eq. Q2I the parameter fi^j is updated in each iteration as 


ht) = h 


(n- 1 ) 


VKi 


^■i,l(Vj t l- 1 ) 


(16) 


Algorithm 1 shows the update steps of the EBP algorithm for BMNN. The weight configuration for 
the BMNN is obtained by simply clipping 


W* hl = sign{hij t i) 


(17) 


3 Implementation of EBP on Image Classification 

The performance of EBP algorithm has been evaluated in ISoudrv et al.l (120141) . However, those 
experiments are limited to high dimensional text datasets (the dimensions of the input feature vectors 
are from 11,463 to 238,739), and all the tasks are binary classification tasks. In this study, we 
will examine the performance of the EBP algorithm on image datasets for multiclass classification. 
To check the performance of EBP algorithm on deeper and small “fan-in” architectures on image 
classification, we use architectures with multiple layers and different hidden unites in experiments. 
Besides, we also explore the effectiveness of dropout techniques dSrivastava et all 12014 ) in EBP 
algorithms. 

Two methods are used to input the image into the MNNs. The first method is to directly convert 
the 2D image into ID vector by concatenating the pixels in the image in certain order, such as 
concatenating each row from top to bottom. For example, for the standard MNIST handwritten 
digits database, the input of each image is 28 x 28 vector. In the second method (spatial filtering 
method), we consider the spatial configuration of the images. The spatial configuration is considered 
in a similar way as Convolutional Neural Networks (CNN) (iLeCun et alXll998h . Each unit in a layer 
receives inputs from a set of units located in a small neighborhood in the previous layer. As shown 
in Fig. |Tl a unit in the feature map has 100 inputs connected to a 10 x 10 area in the input. Each unit 
has 100 inputs and therefore 100 trainable coefficients plus a trainable bias. Different from CNN, 
we only use one feature map in each hidden layer in this studyQ Since there is only one feature map, 
the network does not have the constraint that the connection weights for each unit in the feature map 
are the same. In the example shown in Fig.Q] there are 19 x 19 = 361 units in the second layer and 
each unit have (100 + 1) trainable weights. In implementation, the weight matrix between the first 
and second layer is set to 361 x 784. The weight matrix is initialized in the way that only the weights 
for connected units are nonzero, namely, 361 x 100 nonzero elements in the weight matrix. And the 
zero elements are kept zero during the whole training process. Because the EBP algorithm have the 
assumption of large fan-in, each unit in the hidden layers (feature maps) should be connected to a 
relative large neighborhood (such as “10 x 10” or larger) in the input layer. 

'The performance of EBP algorithm on standard CNN architectures will be studied in further work. 
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Algorithm 1 Expectation BackPropagation (EBP) algorithm for fully connected binary MNNs - 
with binary synaptic weights and real bias (iSoudrv et all 1201 41) 


% Vk,i = (Vk,i ), tanh(hij t i ) = (Wijj), and % is the set of all hijj. 

Function [u L , H ne xt] = UpdateStepBinaryMNN(x, y, ~H) 

> Forward pass 
Initialize: 

Mk : u kt0 = x k , Ml : v 0 ,i = 1 

for m = 1 —> L do 

Mk : 

^k,m — = Er=l l] 

a l,m = K^i 1 + Erir'K 1 - ^r.m-lX 1 ~ <Xm) + ^,m-lSedl 2 (/lfc r , m )]] 

Vk,m “ 1 


end for 

> Backward pass 
Initialize: 


for l = L — > 1 do 


Aj, l = 


Af(0| IH,L,a 2 itL ) 
U 1 4 >(yil*.i,L)/<Ti,L 


Mi : = -jj=N{Q\iJLi t i-i,(T^ l _ 1 ) J2j=i tan h(h jit i)A jtl 

Mi,j : h^f = h ijt i + 

end for 



output 


■=> 


Figure 1: A two-layer neural network architecture that considers the spatial context in images 


4 Experiments 


In this section, we report the experiments of the EBP algorithm with MNNs with different architec¬ 
ture configurations in the standard MNIST handwritten digits database dLeCun et al.Ul998l) . 


4.1 Experiment Setup 

The MNIST database contains 60,000 images (28 x 28 pixels) and the test set has other 10,000 
images. During the training process, all the images in the training set were presented sequentially in 
each epoch with a randomized order. The task was to identify the label {0,1,9}, using a BMNN 
classifier trained by EBP algorithm. The label is set to be y k = 2<5fc,ia&et+i + 1- We pre-process 
the training data by centralizin g (mean = 0) and normalizing (std =1) the pixels as recommended 
for BackPropagation dLeCun et all 12012h . As standard for classification with real values of MNNs. 
The output neuron with highest value indicates predicted label of the input pattern. 
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Table 1: Network architectures in experiments 


Method 

# of Hidden Layers 

# of Hidden Units 

ID Vector 

One 

200; 400; 600; 800; 1000 

Two 

[200, 200]; [400, 400]; L600, 600]; [800, 800] 

2D Convolving 

One 

144; 169; 196; 255; 256; 289 

Two 

[361, 100] 


When treating the image as ID vector, a constant 1 is added to each input vector to allow some bias 
to the neurons in the hidden layer (so Vo = 785). For the spatial filtering method, a bias is added to 
each convolving block. Two neural network architectures are used: one hidden layer and two hidden 
layers. For each type of architecture, we vary the number of neurons in the hidden layers. The 
detailed configurations for the network architectures for both methods are shown in Table Q] In the 
spatial filtering method, different filtering block sizes are used in the one hidden layer architecture: 
12 x 12,13 x 13,14 x 14,15 x 15,16 x 16 and 17 x 17. Thus, the corresponding hidden units are 289, 
256, 255, 196, 169 and 144, which are the feature map size in the hidden layer. Taking block size 
12 x 12 as an example, the feature map size becomes (28— 12 +1) x (28—12 + 1) = 17 x 17 = 289 
hidden units. Accordingly, there are 12 x 12 = 144 inputs to each unit in the hidden layer, and 289 
inputs to each unit in the output layer. We selected such configurations because of the large “fan- 
in” assumption of the EBP algorithm. These configurations can also be used to learn whether it is 
better to set larger fan-in in the first layer or second layer. In the case of two-hidden-layer network, 
we only select one configuration because other configuration will lead to smaller fan-in (the hidden 
units [361, 100] correspond to 100 inputs to each unit in the first layer (10 x 10 block size in the 
input layer) and also 100 inputs (10 x 10 block size in the second layer) to each unit in the second 
hidden layer). 

We also employ dropout technique on all the architectures. Dropout is a technique for preventing 
overfitting and provides a way of approximately combining exponentially many different network ar¬ 
chitectures efficiently to improve performance (ISrivastava et alJ . I2014i) . The effectiveness of dropout 
has been demonstrated on neural networks, DBN and DBM with traditional error backpropagation 
with stochastic gradient decent method dSrivastava et al,Ll2014l) . In this study, we investigate its ef¬ 
fectiveness in the EBP algorithm. In the experiments, we fixed p = 0.8 for both hidden units and 
input units in all dropout nets. 

4.2 Experimental Results 

In the result presentation, we use four abbreviations for presentation simplicity: (1) B-EBP-D: Deter¬ 
ministic EBP (EBP-D, see Sect. 2.1) with binary weights; (2) B-EBP-P: Probabilistic EBP (EBP-P, 
see Sect. 2.1) with binary weights; (3) R-EBP-D: Deterministic EBP with real weights; and (4) 
R-EBP-P: Probabilistic EBP with real weights. All the results reported below are based on the net¬ 
works trained by 120 epochs. Training with more epochs may improve the performances of some 
network architectures. For weight initialization, we used the same method as lSoudrv et al.l (120141) . 

Effects of Hidden Unite Number and Hidden Layer Number Table[2]shows the results of MNNs 
on MNIST dataset using EBP algorithms on different network structures without dropout. From the 
results, we can observe that for networks with one hidden layer, the increase of hidden units clearly 
improves the performance and the best performance is obtained with 800 units. Two-hidden-layer 
structure with EBP-P outperforms the one-hidden-layer structure significantly, even with only 200 
hidden units in each layer. The results demonstrate the EBP works well on MNNs. Another obser- 
vation is that EBP-P outperforms EBP-D, which is consistent with the results shown in iSoudrv et al.l 
(120141) . Particularly, the performance of B-EBP-D in the two-hidden-layer structure is worse than 
that of in the one-hidden-layer structure. With growing size of hidden units, performance of B-EBP- 
D decreases quickly in two-hidden-layer models. We also use the EBP algorithm with real weights 
for all the configurations. The performances of EBP with real weights are better than the perfor¬ 
mance of EBP with binary weights in all structures. R-EBP-P in two-hidden-layer is only slightly 
better than in one-hidden-layer. Although R-EBP-D in two-hidden-layer performs worse than in 
one-hidden-layer as B-EBP-D, its performance increase when the number of hidden units increases. 
The standard BackProp algorithm (using tanh activation function and optimized learning rate) on 
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Table 2: Test errors without dropout for ID vector method 


Hidden units 

200 

400 

600 

800 

1000 

[200, 200] 

[400, 400] 

[600, 600] 

[800, 800] 

B-EBP-P 

3.46% 

3.15% 

3.12% 

3.01% 

3.11% 

2.63% 

2.61% 

2.37% 

2.37% 

B-EBP-D 

4.63% 

3.89% 

3.62% 

3.63% 

3.57% 

5.20% 

5.91% 

13.51% 

27.06% 

R-EBP-P 

2.78% 

2.29% 

2.28% 

2.20% 

2.25% 

2.16% 

2.22% 

2.22% 

2.10% 

R-EBP-D 

3.04% 

2.42% 

2.23% 

2.25% 

2.27% 

2.63% 

2.59% 

2.41% 

2.42% 


Table 3: Test errors with dropout for ID vector method 


Hidden units 

200 

400 

600 

800 

1000 

[200, 200] 

[400, 400] 

[600, 600] 

[800, 800] 

B-EBP-P 

3.60% 

2.82% 

2.82% 

2.55% 

2.52% 

2.93% 

2.39% 

2.12% 

2.12% 

B-EBP-D 

4.91% 

3.50% 

3.45% 

3.10% 

3.08% 

3.97% 

3.18% 

2.89% 

2.68% 

R-EBP-P 

2.45% 

2.04% 

1.90% 

1.87% 

1.88% 

2.22% 

1.78% 

1.75% 

1.66% 

R-EBP-D 

2.58% 

2.09% 

1.94% 

1.91% 

1,86% 

2.51% 

1.99% 

1.87% 

1.75% 


the one-hidden-layer model with 800 units can obtain 2.13913, which is comparable for the best re¬ 
sults obtained by R-EBP-P. Using binary weight will hurt the performance, while from the table, we 
can see that binary weights with optimal neural networks do not hurt the performance much (best 
performance of B-EBP-P is 2.37%, comparing to 2.10% of R-EBP-P). 

Effects of Dropout The results of EBP algorithms on different network structures with dropout are 
shown in Tabled The results show the same observations as those of without dropout. Comparing 
the results between Tableland Tabled we can see that using dropout can improve the performance 
in all configurations, which demonstrates that the dropout also works in the EBP algorithms. From 
the results of using 1000 units and 800 units in one-hidden-layer structure, we can see that without 
dropout, the result of 1000 hidden units is worse that that of 800 hidden units, while with dropout, 
the performance is continuously increasing when increase the hidden unit number from 800 to 1000. 
Besides, with dropout, the performance of B-EBP-D becomes reasonable. The results validate that 
dropout can effectively prevent overfitting in BMNNs with the EBP algorithm. 

Effects of Spatial Filtering Table 0 shows the results of MNNs using the EBP algorithm with the 
consideration of image spatial configuration. The best performance of spatial filtering method using 
binary weights is 3.56% (obtained by 225 hidden units in one-layer structure), which is worse than 
the results of using “ID Input Vector” method as shown in Table [2] On the contrary, the perfor¬ 
mances of using real weights can be improved by the spatial filtering method, as the performance is 
better than all the network structures using “ID Vector Input” method without dropout (the results 
in Table |2). The best results are obtained in the configuration of 256 hidden units (13 x 13 inputs 
to each unit in the hidden layer, and 256 inputs to each unit in the output layer). The results of this 
method shed light on the extension of the EBP method on Convolutional Neural Networks, such as 
the block size connecting to each unit in the feature map in each convolutional layer. 

Summary The analysis of experimental results gives us a few interesting findings. They include: (1) 
BMNNs with the EBP algorithm work well for image classification task, although the performance 
is not as good as real MNN'3: (2) even if the fan-in size is only few hundreds (e.g., [784, 200, 10]), 
the EBP algorithm still works well on BMNNs; (3) BMNNs with EBP-D algorithms on networks 

2 Note that using error regularization and proper weight initialization, standard backpropagation can achieve 
better performance. For example, we can achieve 1.65% error rate by using LI and L2 error regularization and 

initializing the weight uniformly in [— \J f arli +f an t , \J f an . +/ an J with 500 hidden units. 

’Note that the EBP algorithm on MNNs with real weight can obtain comparable results with respect to the 
standard BackPropagation method. 


Table 4: Test errors without dropout for spatial filtering method 


Hidden units 

144 

169 

196 

225 

266 

289 

[361,100] 

B-EBP-P 

4.06% 

3.90% 

3.87% 

3.97% 

4.07% 

4.36% 

4.96% 

B-EBP-D 

4.31% 

3.93% 

3.73% 

3.56% 

3.93% 

4.06% 

4.87% 

R-EBP-P 

2.51% 

2.21% 

2.07% 

2.03% 

1.87% 

1.99% 

1.93% 

R-EBP-D 

2.82% 

2.51% 

2.18% 

2.22% 

2.17% 

2.08% 

2.02% 
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with two-hidden-layer (more layers) outperform the networks with one-hidden-layer; (4) dropout 
can significantly improve the performance of BMNNs with the EBP algorithm; and (5) BMNNs 
with the consideration of spatial filtering does not improve the classification performance, based on 
the results on MNIST. 

5 Conclusions 

In this paper, we report the performance of binary multilayer neural networks (BMNNs) on im¬ 
age classification tasks. Expectation BackPropagation (EBP) algorithm is used to train BMNNs 
with different network architectures and the performance is evaluated on the standard MNIST digits 
dataset. Experimental results demonstrate that BMNNs with the EBP algorithm can achieve good 
performance on the MNIST classification tasks. The results also show that the dropout techniques 
can significant improve BMNNs with the EBP algorithm. Image spatial configuration improves the 
performance of networks with real weights but not that of BMNNs. In this study, we only conduct 
experiments on the MNIST dataset. The performance of BMNNs with EBP algorithm on image 
classification tasks needs to be further validated on other image datasets (e.g., CIFAR10). In the 
future, we would like to study the performance of standard Convolutional Neural Networks with the 
use of EBP algorithm and to explore different weight initialization methods. 
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